Skip to content

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Aug 4, 2025

This commit adds on-heap bulk distance computations. In particular, it implements the methods ESVectorUtil#squareDistanceBulk and ``ESVectorUtil#soarDistanceBulk` to compute four distances in one method call. Microbenchmarks shows a nice speed up, for example for AVX2:

Benchmark                                   (dims)   Mode  Cnt  Score   Error   Units
SquareDistanceBenchmark.soarDistance           384  thrpt    5  4.387 ± 0.167  ops/ms
SquareDistanceBenchmark.soarDistance           782  thrpt    5  1.952 ± 0.357  ops/ms
SquareDistanceBenchmark.soarDistance          1024  thrpt    5  1.658 ± 0.817  ops/ms
SquareDistanceBenchmark.soarDistanceBulk       384  thrpt    5  6.627 ± 0.292  ops/ms
SquareDistanceBenchmark.soarDistanceBulk       782  thrpt    5  3.577 ± 0.255  ops/ms
SquareDistanceBenchmark.soarDistanceBulk      1024  thrpt    5  3.171 ± 0.360  ops/ms
SquareDistanceBenchmark.squareDistance         384  thrpt    5  5.853 ± 0.519  ops/ms
SquareDistanceBenchmark.squareDistance         782  thrpt    5  2.844 ± 0.034  ops/ms
SquareDistanceBenchmark.squareDistance        1024  thrpt    5  2.515 ± 0.104  ops/ms
SquareDistanceBenchmark.squareDistanceBulk     384  thrpt    5  8.669 ± 1.235  ops/ms
SquareDistanceBenchmark.squareDistanceBulk     782  thrpt    5  4.012 ± 0.449  ops/ms
SquareDistanceBenchmark.squareDistanceBulk    1024  thrpt    5  3.360 ± 0.454  ops/ms

The commit updates k-means local to use those new methods which shows a nice speed up. For example, indexing 3 million GLOVE vectors with 200 dimensions.

before:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf   3000008           39798                 81817             0

after:


index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf   3000008           29074                 56799             0

Or 1 million Cohere vectors with 1024 dimensions:

before:

index_name       index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------  ----------  --------  --------------  --------------------  ------------  
wiki1024en.docs         ivf   1000008           51073                106968             0

after:

index_name       index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------  ----------  --------  --------------  --------------------  ------------  
wiki1024en.docs         ivf   1000008           32692                 85660             0

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iverase iverase merged commit 6578b9e into elastic:main Aug 4, 2025
34 checks passed
@iverase iverase deleted the distanceBulk branch August 4, 2025 11:32
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 5, 2025
…cking

* upstream/main: (26 commits)
  [Fleet] add privileges to `kibana_system` to read integrations data (elastic#132400)
  Add `TestEntitlementsRule` with support for dynamic entitled node paths for testing (elastic#132077)
  Reduce logging frequency for GCS per project clients (elastic#132429)
  Skip update/100_synthetic_source tests in yamlRestCompatTests (elastic#132296)
  Correct exception for missing nested path (elastic#132408)
  Fixing esql release tests elastic#132369 (elastic#132406)
  Adjust date docvalue formatting to return 4xx instead of 5xx (elastic#132414)
  Handle nested fields with the termvectors REST API in artificial docs (elastic#92568)
  Only collect bulk scored vectors when exceeding min competitive (elastic#132293)
  Fix release tests diskbbq update (elastic#132405)
  ESQL: Fix skipping of generative tests (elastic#132390)
  Short circuit failure handling in OIDC flow (elastic#130618)
  Small optimization in OptimizedScalarQuantizer by using mul instead of div (elastic#132397)
  Aggs: Add validation to Bucket script pipeline agg (elastic#132320)
  ESQL: Multiple parameters in ungrouped aggs (elastic#132375)
  ESQL: Explain test operators (elastic#132374)
  EQL: Deal with internally created IN in a different way for EQL (elastic#132167)
  Speed up hierarchical k-means by computing distances in bulk (elastic#132384)
  Reduce the number of fields per document (elastic#132322)
  Assert current thread in ESQL (elastic#132324)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants